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HIGHLIGHTS 

• Value-added measures are positively related to almost all other commonly accepted measures of 
teacher performance such as principal evaluations and classroom observations. 

• While policymakers should consider the validity and reliability of all their measures, we know 
more about value-added than others. 

• The correlations appear fairly weak, but this is due primarily to lack of reliability in essentially all 
measures. 

• The measures should yield different performance results because they are trying to measure 
different aspects of teaching, but they differ also because all have problems with validity and 
reliability. 

• Using multiple measures can increase reliability; validity is also improved so long as the additional 
measures capture aspects of teaching we value. 

• Once we have two or three performance measures, the costs of more measures for accountability 
may not be justified. But additional formative assessments of teachers may still be worthwhile to 
help these teachers improve. 


INTRODUCTION 

In the recent drive to revamp teacher evaluation and accountability, measures of a teacher's value 
added have played the starring role. But the star of the show is not always the best actor, nor can the 
star succeed without a strong supporting cast. In assessing teacher performance, observations of 
classroom practice, portfolios of teachers' work, student learning objectives, and surveys of students 
are all possible additions to the mix. 

All these measures vary in what aspect of teacher performance they measure. While teaching is 
broadly intended to help students live fulfilling lives, we must be more specific about the elements of 
performance that contribute to that goal - differentiating contributions to academic skills, for 
instance, from those that develop social skills. Once we have established what aspect of teaching we 
intend to capture, the measures differ in how valid and reliable they are in capturing that aspect. 

Although there are big holes in what we know about how evaluation measures stack up on these two 
criteria, we can draw some important conclusions from the evidence collected so far. In this brief, we 
will show how existing research can help district and state leaders who are thinking about using 
multiple measures of teacher performance to guide them in hiring, development, and retention. 
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WHAT DO WE KNOW ABOUT HOW ALTERNATIVE MEASURES OF TEACHER 
EFFECTIVENESS COMPARE? 

The simplest way to judge how well various measures compare with each other is to calculate a 
"correlation," a statistic that indicates the extent to which two numbers move in tandem. When two 
measures are unrelated to one another the correlation is 0.0. When two measures are perfectly 
correlated the correlation is 1.0. 1 

The Measures of Effective Teaching (MET) study, a much-cited study funded by the Gates 
Foundation, 2 finds correlations of 0.12 to 0.34 between value-added measures and classroom 
observation rubrics such as the Danielson Framework. The connection is nearly identical to the 
correlations that prior studies have found between value-added measures and confidential low-stakes 
evaluations of teachers by their principals. 3 MET found stronger relationships between value-added 
measures and student surveys of teacher practice. 4 Although students certainly are not expert judges 
of effective teaching, they are with teachers every day, and it is their performance on standardized 
tests that ultimately determines a teacher's value added. 

To put these numbers into perspective, I created two hypothetical performance measures and placed 
100 teachers into one of four equally-sized performance categories, "A" being high and "D" being low. 
Correlations of 0.2, such as those above, are consistent with 32 teachers remaining in the same 
categories under both measures (e.g., performance level A in both cases), 44 teachers switching by 
one performance level, 18 switching by two levels (e.g., A to C), and 6 teachers switching from the top 
to the bottom or vice versa. 

Another measure that is positively related to value-added measures is a teacher's overall score from 
the National Board for Professional Teaching Standards (NBPTS). 5 Not surprisingly, the individual 
components of that score— videos of instruction, teacher portfolios, and standardized tests— are also 
positively related to value-added estimates. All of this evidence reinforces the general conclusion that 
almost all the other measures now being considered are positively correlated with value-added 
measures. 6 

How should we interpret these correlations? Are they strong enough to justify the use of value-added 
measures? At least one report has criticized teacher value-added measures because they do not line 
up closely enough with other measures. 7 Certainly, the level of agreement is not very high. Below, we 
consider why that is and what the implications are. 

Validity, reliability, and classification errors 

When selecting performance measures, it is important to establish criteria for evaluating them and to 
apply the same criteria to all. When we talk about "accurate" measures of teacher performance, 
researchers mean measures that are valid and reliable. Validity refers to the degree to which 
something measures what it claims to measure. Reliability refers to the degree to which the measure 
is consistent when repeated. A measure could be valid on average, but inconsistent when repeated. 
Conversely, a measure could be highly reliable but invalid— that is, it could consistently provide the 
same invalid information. 
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Validity is closely related to the idea of bias. We tend to think of bias as arising in subjective decisions. 
Someone who works at the Ford Motor Company might say, "I love Ford cars, but I work there, so I'm 
biased," which is to say that the person's views might not correspond to the objective qualities of the 
car. Likewise, with teacher evaluation, if the observer knows the teacher personally, we might worry 
that she is hampered in her ability to objectively judge that teacher's performance. Also, some school 
principals and other observers may simply not know what to look for when assessing classroom 
practice; this, too, can introduce bias. 

Value-added measures can also be biased, but in a somewhat different way. A common criticism of 
value-added measures is that some teachers are at a disadvantage because they are assigned students 
who are more difficult to educate, even after the measures account for students' prior test scores; this 
is what researchers call selection bias. No matter how many times we calculate value-added for these 
teachers, this form of bias means the results will still be invalid. 8 

Serious consequences arise when measures are not valid and reliable: they increase classification 
errors - the placement of teachers into incorrect performance categories. In Florida, the District of 
Columbia, and a growing number of states and districts, performance measures, including value- 
added estimates, play a key role in placing teachers in performance categories. Being misclassified as 
unsatisfactory could mean losing a job. 

Evidence on validity and reliability 

The evidence on the validity and reliability of value-added estimates is evolving and may be 
misunderstood. 9 Strictly speaking, establishing that a measure is valid requires comparing it to a 
"true" measure that we know to be correct, but this is essentially impossible with teaching. Instead, 
researchers try to establish validity indirectly. One widely publicized study, which created a statistical 
test of the validity of value-added measures, 10 found reason for concern. Several subsequent studies 
suggest that the measures are probably reasonably valid on average. 11 Another study randomly 
assigned students to teachers and found that selection bias was only a small problem, although there 
is debate about the interpretation of this result. 12 

It is important to note, however, that even if the conclusions from these studies are right, they 
provide evidence about whether value-added measures are valid on the average across large numbers 
of teachers. 13 They could still be— and apparently are— invalid for specific subgroups of teachers. For 
example, one study, which points out that almost all the evidence about validity is based on studies in 
elementary schools, provides evidence that typical value-added measures are biased in middle and 
high school. 14 Another study suggests that teachers whose students start off with very high 
achievement will receive lower performance ratings than they deserve because of the "test ceiling". 15 
Value-added measures are probably also highly sensitive to the context of teachers' classrooms, 
including behavioral issues and the school culture. 16 The assumptions underlying value-added models 
have also been shown to be false, 17 which may influence different teachers in different ways. In 
general, a measure cannot be considered valid if it is heavily influenced by factors that are outside the 
control of teachers. So these findings raise important concerns. 18 

Reliability is also considered a significant issue with value-added measures. As with correlations, the 
highest possible reliability measure is 1.0, which means that the performance measure does not 
change at all over time. 19 At the other extreme, reliability of zero means that any given measure of 


CarnegieKnowledgeNetwork.org 


HOW DO VALUE-ADDED INDICATORS COMPARE TO OTHER MEASURES OF 

TEACHER EFFECTIVENESS? 


teacher performance tells us nothing about what the measure will be for the same teacher the next 
time. 20 When creating student tests, for example, designers usually set a standard of at least 0.9. 21 
The MET study reports reliability for teacher value-added measures of about 0.3 to 0.5 when three 
years of data are used. 22 To put this in perspective, one study finds that only 28 to 50 percent of 
teachers who were ranked in the top fifth on value-added measures one year were still ranked in the 
top fifth in the subsequent year, and 4 to 15 percent of teachers switched from the top fifth to the 
bottom fifth. 23 

Critics of value-added measures might stop right here and argue that the limited reliability argues 
against using value added at all. But here again, we have to compare the alternatives. A single 
classroom observation has lower reliability than a value-added measure, but a combination of four 
classroom observations yields a higher reliability of about 0.65. 24 The validity of classroom observation 
measures is less clear. The MET study involved highly trained observers who had literally passed an 
exam demonstrating their skill; this level of training is unlikely in everyday school settings. Older 
evidence suggests that classroom observations can be influenced by factors unrelated to 
performance, such as age and race. Also, it seems likely that classroom context will affect observation 
measures just as it appears to affect value-added measures. For example, it may be difficult to make 
valid comparisons between the classroom management skills of a teacher who has emotionally 
impaired students, subject to frequent disruptions, to the skills of a teacher whose students are less 
disruptive. 

On other evaluation measures, we have almost no evidence about validity and reliability. With the 
measure known as student learning objectives (SLOs), teachers work with their instructional leaders to 
identify student needs, create specific objectives, and establish metrics based on student work to 
establish progress toward those objectives. This process combines the outcomes orientation of 
student test scores with the more subjective elements of classroom observations. While these kinds 
of evaluations are somewhat similar to the teacher portfolios used in NBPTS, they are difficult to 
assess. 

Most of the evidence cited above involves low-stakes measures, but the validity and reliability of all 
the measures is likely to be influenced by the stakes attached to them. 25 When a district uses value- 
added measures, some teachers might try to "game the system" and get assigned to students who are 
most likely to make achievement gains or to move to schools where the biases seem to give teachers 
an advantage. Alternatively, high stakes might cause classroom observers to take their task more 
seriously — to be more careful in their work. 

Measures that allow for more subjectivity and local control, such as classroom observations and SLOs, 
are subject to their own types of bias. If classroom observers have personal relationships with 
teachers they may, however unintentionally, help their friends. Likewise, if teachers can largely set 
their own objectives with SLOs, they may set the bar low to ensure that they appear successful. 

SLOs are potentially attractive because they can be used in all classrooms, allow local autonomy, and 
fit well with customized instruction. Some view the autonomy as a weakness because the measures 
are unlikely to be comparable across teachers and may be too easily manipulated to give the 
appearance of high performance. In any event, there is essentially no evidence about the validity or 
reliability of SLOs. 
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From validity and reliability to practicality 

In addition to questions of validity and reliability, it is worth briefly considering the practicality and 
costs of the various evaluation measures. Value-added measures, in most districts, can only be used 
with about one-third of teachers who are in tested grades and subjects and who have at least two 
years of data. This necessitates some other approach with other teachers. On the other hand, value- 
added measures are fairly inexpensive once the testing regime is in place. Also, while some have 
criticized the complexity of value-added measures, at least one handbook for SLOs is nearly 60 pages 
long, and classroom observations can involve more than 100 sub-measures. A complete comparison of 
multiple measures requires that these practical considerations be accounted for as well. 


WHAT MORE NEEDS TO BE KNOWN ON THIS ISSUE? 

We know much more about value-added measures than we do about other evaluation methods, so 
clearly we need more research on the latter, as well as additional research on value-added measures 
to determine how valid they are for particular groups of teachers. The limited evidence is a big 
problem; it means that even what we think we know about value added does not get us very far in 
deciding what to do with it. Without conducting similar analyses on the other measures, we can't 
compare alternatives and choose the best options. 

More research on other measures would also help us understand why the correlations among them 
are so modest— why they differ as much as they do. The first reason they differ is simply that they 
each measure really captures a different notion of teacher performance; they should be different. For 
example, we have every reason to believe that principals care about students' academic achievement 
more than anything else. In one study, principals' assessments of overall teacher performance and 
their assessment of teacher contributions to student achievement are correlated at about 0.7, very 
high. 26 However, principals rank a "caring" disposition as one of the most important teacher traits. 27 
Clearly, a principal who cares mainly about academic achievement thinks about teacher performance 
differently than one who prefers a caring personality. 

If each measure were valid and reliable, the correlations would no doubt be much higher. But even 
then, the correlations would still be less than 1.0 for two reasons. First, one measure might be less 
valid than another, even when the intended notion of teacher performance is the same. Second, the 
maximum correlation is roughly equal to the reliability of the two measures, which is generally much 
less than 1.0. 28 For example, two measures with reliabilities of 0.5 (which seems realistic given the 
above measures) have a maximum correlation of 0.5. 

These examples are largely hypothetical because we lack evidence on the validity and reliability of 
measures other than value-added. But the evidence does suggest that the main reason the measures 
differ is that each measure is unreliable. This is an important lesson because there are steps that can 
be taken to increase reliability, such as increasing the number of classroom observations and years of 
data used in value-added calculations. 
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WHAT CAN'T BE RESOLVED BY EMPIRICAL EVIDENCE ON THIS ISSUE? 

While choices about the mix of measures should be made partly based on evidence, they also require 
value judgments. We have to decide first what aspects of teaching we value. Are we more concerned 
about students obtaining academic skills or social skills or creativity? Choosing the right mix of 
measures therefore depends on what we think school should be trying to achieve. A valid measure of 
teacher performance is one designed to capture how well teachers contribute to the student 
outcomes we value most. On this, there are legitimate differences of opinion. 


PRACTICAL IMPLICATIONS 

Does this issue impact district decision making? 

There is wide support for using more measures in addition to value added to make high-stakes 
decisions. If a multiple-measures approach helps create a composite that is more representative of 
what stakeholders value, then validity improves. Using multiple measures also improves reliability, up 
to about 0.65 in the MET study. 29 This figure is still below conventional levels in educational 
assessment, but it is better than the alternative of a single measure. 

But we can go further in thinking about how many, and which, measures should be used. Basic 
economic theory provides a useful perspective on multiple measures. First, economics recognizes that 
quality teacher evaluation is expensive and time-consuming. To observe teachers in class, for 
example, principals must take time away from other duties, and some of the best teachers must be 
pulled from the classroom to evaluate others. What matters is not just how many measures are used, 
but how much information is collected with each. 

Second, basic economics suggests that when two measures are highly correlated, there is not much 
point in using both of them. This issue might seem moot since none of the measures are highly 
correlated, but the same principle applies. It's just that we have to interpret "highly correlated" based 
on the maximum correlation possible. 

Yet many states and districts are considering using three or more measures. So the question then 
becomes: how much additional information would a third or fourth measure bring? The answer 
depends on the reliability of the additional measure, as well as how the random error in the additional 
measure correlates with the random errors in the other measures. In general, additional measures will 
increase both validity and reliability, but at some point the additional gain is not worth the cost. 

Austin, Texas schools use 13 measures to evaluate teachers - a costly strategy that may confuse 
teachers about what they are supposed to be aiming for. States and districts can test the worth of 
adding more measures by calculating the correlations between simpler and more complex composite 
measures. If the correlations are very high, it might indicate that the additional measures are not 
worthwhile. 

The economics-based approach, however, focuses on so-called summative performance measures 
that evaluators use to make high-stakes decisions about teachers' salaries and careers. Organizations 
also need formative information to help teachers improve; they need indicators of a teacher's specific 
skills in classroom management, for instance, or her ability to provide meaningful feedback to 
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students. Both types of measures are important . 30 So, even if an additional measure gives evaluators 
little in the way of summative information, it may be quite valuable for the formative information it 
provides. 

Performance measures are the lynchpin of teacher evaluation systems. The choice of measures is 
therefore the crucial first decision for administrators developing any system of teacher improvement 
and accountability. We have learned a great deal about the strengths and weaknesses of one of those 
measures - valued added - but we need to know much more about the others. After all, we can't 
decide how best to use value-added measures without determining how the other measures 
compare. So far, the modest correlations we see imply that different evaluation measures will yield 
different results for the same teacher. We can reduce these classification errors by using multiple 
measures to improve validity and reliability, and by creating additional checks and balances when 
making high-stakes decisions. We can never eliminate classification errors, but we can reduce them. 
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